Machine Translat ion vs . Dic t ionary Term Translat ion
نویسندگان
چکیده
Bilingual news article alignment methods based on multi-lingual information retrieval have been shown to be successful for the automatic production of so-called noisy-parallel corpora. In this paper we compare the use of machine translation (MT) to the commonly used dictionary term lookup (DTL) method for Renter news article aligmnent in English and Japanese. The results show the trade-off between improved lexical disambiguation provided by machine translation and extended synonym choice provided by dictionary term lookup and indicate that MT is superior to DTL only at medium and low recall levels. At high recall levels DTL has superior precision. 1 I n t r o d u c t i o n In this paper we compare the effectiveness of full machine translation (MT) and simple dictionary term lookup (DTL) for the task of English-Japanese news article alignment using the vector space model from tahiti-lingual information retrieval. Matching texts depends essentially on lexical coincidence between the English text and the Japanese translation, and we see that the two methods show the trade-off between reduced transfer ambiguity in MT and increased synonymy in DTL. Corpus-based approaches to natural language processing are now well established for tasks such as vocabulary and phrase acquisition, word sense disanabiguation and pat tern learning. The continued practical application of corpus-based methods is critically dependent on the availability of corpus resources. In machine translation we are concerned with the provision of bilingual knowledge and we have found that the types of language domains which users are interested in such as news, current affairs and technology, are poorly represented in today's publically available corpora. Our main area of interest is English-Japanese translation, but there are few clean parallel corpora available in large quantities. As a result we have looked at ways of automatically acquiring large amounts of parallel text for vocabulary acquisition. The World Wide Web and other lnternet resources provide a potentially valuable source of parallel texts. Newswire companies for example publish news articles in various languages and various domains every day. We can expect a coincidence of content in these collections of text, but the degree of parallelism is likely to be less than is the ease for texts such as the United Nations and parliamentary proceedings. Nevertheless, we can expect a coincidence of vocabulary, in the case of names of people and places, organisations and events. This time-sensitive bilingual vocabulary is valuable for machine translation and makes a significant difference to user satisfaction by improving the comprehensibility of the output. Our goal is to automatically produce a parallel corpus of aligned articles from collections of English and Japanese news texts for bilingual vocabulary acquisition. The first stag(: in this process is to align the news texts. Previously (Collier et al., 1998) adapted multi-lingual (also called "translingual" or "cross-language") information retrieval (MLIR) for this purpose and showed the practicality of the method. In this paper we extend their investigation by comparing the performance of machine translation and conventional dictionary term translation for this task. 2 M L I R M e t h o d s There has recently be~.n much interest in the MLIR task (Carbonell et al., 1997)(Dumais et al., 1996)(Hull and Grefenstc.tte, 1996). MLIR differs from traditional informal ion retrieval in several respects which we will discuss below. The most obvious is that we must introduce a translation stage in between matching the query and the texts in the document collection. Query translation, which is currently considered to be preferable to docmnent collection translation, introduces several new factors to the IR task: • T e r m t r a n s f e r m i s t a k e s analysis is far from perfect in today's MT systems and we must con-
منابع مشابه
When Mariko Talks to Siegfried - Experiences from a Japanese/German Machine Translation Project
In this paper we wi l l report on our experiences f rom a 2 1/2 year project that designed and implemented a prototypical Japanese to German translat ion system for t i t les of Japanese papers.
متن کاملA Bidirectional, Transfer-Driven Machine Translation System For Spoken Dialogues
This paper presents a br ief overview of the bidirectional (Japanese and English) TransferDriven Machine Translation system, currently being developed at ATR. The aim of this development is to achieve bidirectional spoken dialogue translat ion using a new translat ion technique, TDMT, in which an example-based framework is fully utilized to translate the whole sentence. Although the translation...
متن کاملTM: An Exmnt)le-Based Translation Aid System
This paper describes a Japanese-English translat ion aid system, C'I 'M, which has a usefid capabili ty for flexible retrieval of texts f rom hilingaal corpora or t ranslat ion databases. Transla t ion examples (pairs of a text and its translat ion equivalent) are very helpful for us to t ranslate the similar text. Our character-based best ma tch retrieval method can re: trieve translat ion exa...
متن کاملCTM: An Example-Based Translation Aid System
This paper describes a Japanese-English translat ion aid system, C'I 'M, which has a usefid capabili ty for flexible retrieval of texts f rom hilingaal corpora or t ranslat ion databases. Transla t ion examples (pairs of a text and its translat ion equivalent) are very helpful for us to t ranslate the similar text. Our character-based best ma tch retrieval method can re: trieve translat ion exa...
متن کاملAn English-To-Korean Machine Translator: MATES/EK
This note introduces at, English-to-Korean Machiue 'lYanslation System MA!t'ES/EK, which ha.s been developed as a research prototype and is still under upgrading in KAIST (Korea Advanced Institute of Science and Technology). MATES/EK is a transfl:r.-ba.~ed system and it has several subsystems tt, at can be used to support other MT-developments. They are gram-. mar developing environment systems...
متن کامل